Identifying Signatures of Selection 
in Genetic Time Series 



Alison Feder 1 '*, Sergey Kryazhimskiy 2 *, Joshua B. Plotkin 1 



department of Biology, University of Pennsylvania, Philadelphia, PA 

2 Department of Organismic and Evolutionary Biology and FAS Center for Systems Biology, Harvard University, Cam- 
bridge, MA 
* Equal contribution 



1 



Running Head: Selection in genetic time series 



Keywords: 

Corresponding author: 

Joshua B. Plotkin 

Department of Biology 

University of Pennsylvania 

Philadelphia 

PA 19104 

USA 

Phone 

Fax 

Email: jplotkin@sas.upenn.edu 



2 



Abstract 

We develop a rigorous test for natural selection based on allele frequencies sampled from a population 
over multiple time points. We demonstrate that the standard method of estimating selection coefficients 
in this setting, and the associated chi-squared likelihood-ratio test of neutrality, is biased and it therefore 
does not provide a reliable test of selection. We introduce two methods to correct this bias, and we 
demonstrate that the new methods have power to detect selection in practical parameter regimes, such as 
those encountered in fitness assays of microbial populations. Our analysis is limited to a single diallelic 
locus, assumed independent of all other loci in a genome, which is again relevant to simple competition 
assays of laboratory and natural isolates; other techniques will be required to detect selection in time series 
of co-segregating, linked loci. 
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1 Introduction 



Population geneticists typically seek to understand the forces responsible for patterns observed in contemporaneous 
samples of genetic data, such as the nucleotide differences fixed between species, polymorphisms within populations, 
and the structure of linkage disequilibrium. Recently, however, there has been a sharp increase in the abundance of 
dynamic data, where the frequencies of segregating alleles in an evolving population are monitored through time, 
both in laboratory experiments ( |HEGRENESS et al.\ |2006[ |BARRICK et al.\ |2009| |LANG et dL\ |2011| > and in natural 
populations ( |Barrett et aZ.||2008"j|REiD et aT\ |20 1 1 [ [Denef and Banfield||2012||Winters et <z/. j |2Q 1 2^ |D aniels | 
\et al.\ |2012| |PENNINGS et al. \ |2012| l. One of the important questions is whether the changes in allele frequencies 
observed in such data are the result of natural selection or are simply consequences of genetic drift or sampling artifacts. 
In principle, it would seem that dynamic data might provide researchers with more power to detect and quantify 
selective forces while avoiding the assumptions of stationarity that are required for many inference techniques based 
on static samples ( |Sawyer and Hartl]|1992||Desai and Plotkin| [20081 |Boyko et fl£||2008) . Nonetheless, the 
behavior and power of inference techniques based on time series of genetic data have not been thoroughly investigated. 

There is a well-developed literature on inferring population sizes from genetic time-series data assuming neutrality 
( |Pollak| [19831 |Waples| [T9891 |Williamson and Slatkin| [19991 |Wang| |200T) , but a much smaller literature 



on inferring natural selection from such time series (BOLLBACK et al. 2008 ILLINGWORTH and MUSTONEN 2011 



Illingworth et al] |2012| |Malaspinas et al.\ |2012[ |Mathieson and McVean[ |20131 >. Even the simplest case, 



which considers the dynamics of two alternative alleles at a single genetic locus, independent of all other loci, presents 



a number of statistical challenges that have not yet been addressed. BOLLBACK et al. (20081 fit two nested Wright- 
Fisher models to time-series data at a single locus (one model with selection, and one without), and rejected the neutral 
model using the x 2 distribution for the likelihood ratio statistic. Such an approach is generally the most powerful and 
unbiased, at least for large datasets. Nonetheless, here we show that in practice the actual frequency of false positives 
under this approach can vastly exceed the nominal P-value obtained from the \ 2 distribution - and especially so at 
more stringent P-value cutoffs. Since the \ 2 distribution does not provide an accurate representation of the false 
positive rate, this approach cannot be used to draw sound statistical conclusions about the presence of selection from 
such time series. The underlying reason for this problem is that the likelihood ratio statistic is x 2 -distributed only 



asymptotically, and convergence to this distribution is slow (WlLKS 1938 1. In most practical applications, such as 



when sampling from natural populations (Reid et al. 2011 Denef and BANFIELD 2012 WINTERS et al. 2012 



Daniels et al. 2012 Pennings et al. 2012 1 or competing two microbial strains ( Lenski et al. 1991 Bollback 



|and Huel senbeck] |2007[ l, the number of sampled time points is small (less than 10), and the distribution of the 
likelihood ratio statistic is far from x 2 under neutrality, leading to more false positives then expected. 

We propose two solutions to fix this problem, providing unbiased tests for natural selection in time-series data sam- 
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pled at a single genetic locus. First, we develop an algorithm for computing the exact distribution of the likelihood ratio 
statistic under neutrality. Although feasible in many regimes, this direct approach is computationally intensive. We 
also propose an alternative, computationally efficient, albeit approximate, statistical method for rejecting the neutral 
model. Our approach builds directly on the work of |BOLLBACK et aT] ( |2008| ), and it is likewise limited to studying time 
series of allele frequencies at a single haploid locus, assuming independence from all other loci. The far more com- 
plicated problem of detecting selection from genomic time series of many linked loci has received attention elsewhere 
( jlLLlNGWORTH and Mustonen]|2011| i, and the problems identified here likely apply to those situations as well. 

We start our presentation by introducing a likelihood framework for time-series data at a single genetic locus, 
and we then derive approximate expressions for these likelihoods under the Gaussian approximation to the Wright- 
Fisher model. We then demonstrate that the P-value given by the x 2 distribution for the likelihood ratio statistic 
underestimates the actual false-discovery rate. Finally, we introduce two methods to correct this bias and we verify that 
they are virtually unbiased, even for small sample sizes. We conclude by quantifying the power of these methods to 
detect selection in different parameter regimes, considering also sampling noise in the allele frequency measurements. 

2 Materials and Methods 

2.1 General expression for the likelihood of time series allele-frequency data 

Under any standard single-locus population-genetic model, the dynamics of an allele in a population are described by 
a Markov process that specifies the transition probability P s (x, t\x' , t') that the allele assumes frequency x at time t, 
given that its frequency was x' at some previous time point t' . This transition probability depends on two parameters 
of interest, the population size N and the selection coefficient s acting on the allele ( |EWENS||2004| l and possibly also 
on other nuisance parameters. Ignoring sampling noise, the likelihood of observing allele frequencies i/q, v\, . . . , i/l at 
times 0, ti, . . . , £i is 

L 

L(Data;iV,s) = U^YlPifaUWi-uti-i), (1) 

i=l 

and, under the neutral null hypothesis, 

L 

L(Data; N, 0) = Ufa) JJ P {v h UWi-i, U-i)- (2) 

Here U(x) denote the probability of observing allele frequency x at time point which for simplicity we set to be 
uniform on the interval (0, 1), i.e., U (x) = 1. 

Given an algorithm for computing expressions (0 and we can find the parameters values N and s that maximize 
the likelihood function (lib and the value N that maximizes the likelihood function |2b under the neutral null hypothesis. 
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We perform these maximization procedures using the golden section algorithm (KlEFER 1953 ) implemented in the 
Gnu Scientific Library (GSL) package. 



2.2 Computing the likelihood of time series data 



Under most population-genetic models it is impossible to exactly compute the likelihood expressions ([T]), (|2| because 
no exact analytic expressions for the transition probability P s (x, t\x' , t') are available for arbitrary x, x' , t, t'. Thus, 
various approximations have been proposed. The most well-known approximation to the Wright-Fisher and Moran 
models is the diffusion approximation of Kimura and others (Ewens 2004). Although considerably simpler than the 
discrete Wright-Fisher or Moran models, the diffusion equation is still too difficult to solve exactly and efficiently in a 
general case. Although some methods are available ( |KlMURA||1955a|b[|EVANS et q"T||2007[|BOLLBACK et al.\\20QS 



SONG and STEINRUCKEN 2012 1, they are rather cumbersome to implement numerically. 

We therefore resort to the Guassian approximation which is less general than the diffusion approximation but allows 
us to compute the transition probability efficiently, and quite accurately provided the allele has not been lost or fixed 
during the timescale of observation. We describe this approximation in detail in Appendix A. Briefly, if the time scale 
of observation is short compared to N, i.e., if absorption events can be ignored, the Moran process can be approximated 
by a sum of a deterministic process g and a Gaussian noise process Z ( Pollett]|1990 ). In the absence of genetic 
drift, i.e., when N — > oo, the allele frequency X behaves deterministically, X —> g(t, Xq), where g satisfies the logistic 
equation 



9 = sg(l - g), 
5(0, x ) = x , 



(3) 
(4) 



whose solution is 



g(t,x Q ) = x Q (x + (1 - x )e st ) . 



(5) 



Here xq is the initial deterministic allele frequency. When N < oo, genetic drift perturbs the allele frequency X from 
its deterministic value and so X(t) — g(t, xq) + Z(t) where Z(t) is the noise process. Then for any two time points 
t > and t' > 0, the transition probability is approximated by 



P s (x,t\x' ,t') 



N 



27TCT 2 (At,g') 



exp ■ 



-N- 



■g-(x'-g>)M(At,g')y 
2a 2 (At, g') 



(6) 
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where 



M(At,0 = e- sAt (£ + (l-Oe- sAt ) , (7) 
«r a (At,0 = M a (At,0(2 + *)£(l-0* _1 



2£s(l - C)At + £ 2 e sAt - (1 - £)V sAt + ((1 - - 



(8) 



and we used shorthands g = g(t, xq), g' = g(t' , xq), At = t — t' . Under the neutral null hypothesis (i.e., when s = 0), 
the transition probability simplifies to 



/ N \ (x-x'S 2 I 

P (x,t\x' « J -expl-N \ ' , (9) 

y 2ira£{At,xo) [ M( A ^ x o) J 

with 

al(At,x ) = 2x (l-x )M. (10) 

Note that functions (|5]l-([T0]i depend on three parameters, N, s and xq. xq is a nuisance parameter that in principle can 
be fitted along with N and s. However, for the sake of reducing the number of fitted parameters, we fix xo to be equal 
to the observed allele frequency at time zero, xo = vq. 

We assumed here and for the remainder of the paper that time is measured in generations ( |EWENS||2004| l. If we 
measure time in physical units, equations (|5|l-(fT0]i still hold, albeit with rescaled parameters N — > Nt, and s — > s/r, 
where t is the generation time. 



3 Results & Discussion 

We consider the problem of determining whether selection has played a role in shaping the fluctuations in the observed 
frequencies of an allele in a population over time. Suppose that at time points < t\ < ■ ■ ■ < we sample 
Uq, ni, • • • j nL individuals from a given population of an unknown size N and find that bi of them carry allele A at the 
locus of interest. Thus, we observe sampled allele frequencies vq = bo /no, v\ = bi/n%, . . . , vl = bi/nL- We are 
concerned with the question of whether genetic drift and sampling alone are sufficient to explain the fluctuations in the 
observed frequency of allele A, or whether these frequency changes arose due to the action of natural selection either 
at the specified or another, completely linked, locus. Specifically, we consider the following pair of nested hypotheses. 
Under the neutral null hypothesis, changes in allele frequency are caused only by genetic drift and sampling, i.e., the 
selection coefficient s acting on allele A is fixed to be zero. Under the alternative hypothesis, there are no restrictions 
on s. 

We initially ignore sampling noise. In other words, we assume that m 3> 1, 1 <C h <C rij for all i, so that the 
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sampled allele frequencies ^ accurately represent the actual frequencies in the population. We later investigate how 
sampling noise affects our conclusions. 



3.1 Likelihood ratio statistic is not x 2 distributed for finite data 

Typically, testing the neutral null hypothesis involves constructing the likelihood ratio statistic (LRS) defined by 

i ? (Data).21o g f L i Dat ^V (11) 

where L(Data; N, s) and L(Data; N, 0) are the likelihoods of the data under the alternative and the null hypotheses, 
respectively; N and s maximize i(Data; N, s); and N maximizes L(Data; N, 0) (see Materials and Methods for 
details). The Neyman-Pearson lemma guarantees that the LRS defines the most powerful test of a given size ( |Stuart| 



et al. 2009 1. In other words, the test that rejects the null hypothesis whenever i?(Data) > H a , with H a chosen so 
that the probability of a Type I error is a, has the lowest probability of Type II error among all tests that have the same 
probability of Type I error, a. 

However, the Neyman-Pearson lemma holds only for simple null hypotheses, i.e., when the distributions of the 
observed random variables do not depend on parameters. In our problem, the null hypotheses is composite since the 
distributions of allele frequencies depend on an additional parameter, N, whose value is unknown. Importantly, this 
implies that the distribution of the likelihood ratio statistic is unspecified under the null hypothesis. Thus, not only is the 
likelihood ratio test not guaranteed to be the most powerful, there is no general way of determining the critical regions 
for the LRS distribution. The standard way to circumvent the latter problem is to use the asymptotic distribution for the 
LRS. When the number of data points approaches infinity, the LRS distribution converges to the \ 2 distribution with 



one degree of freedom, under appropriate regularity assumptions ( WlLKS 1938 1. This approach has been previously 
used in the context of allelic time series by |BOLLBACK et al. (2008 ). 



Although the LRS is guaranteed to be asymptotically \ distributed, the rate of convergence to this distribution 



1938). We set out to determine how well the 



is O (L -1 / 2 ) where L is the number of sampled time points (WlLKS 
X 2 distribution approximates the true distribution of LRS when the number of data points is finite. This question is 
important because the use of an incorrect null distribution could result in a test that underestimates the fraction of Type 
I errors and thus erroneously rejects the null hypothesis more often than indicated by its P-value. 

With this goal in mind, we simulated the neutral two-allele Wright-Fisher model with population size N, without 
mutation, with allele A initiated at 50%. We recorded the allele frequency every generation for T generations. We 
then produced a dataset consisting of these frequencies sampled every A generations. We considered a total of L + 1 
sampled time points, so that A = T/L. In order that the Gaussian approximation used to compute likelihoods is 
accurate, we made sure that absorption events are rare by setting T < N / 10. For each population size N, we simulated 
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a 




N 


T 


L 


A 


N/N 


5% 


1% 


0.1% 


0.01% 


10 4 


10 


10 


10" 


1.25 


1.41 


1.74 


2.31 


3.25 


10 4 


100 


10 


10 1 


1.26 


1.44 


1.78 


2.40 


3.43 


10 4 


1000 


10 


10 2 


1.29 


1.64 


2.37 


4.26 


8.08 


10 3 


100 


10 


10 1 


1.29 


1.70 


2.49 


4.54 


8.26 


10 4 


1000 


10 


10 2 


1.29 


1.64 


2.37 


4.26 


8.08 


10 5 


10000 


10 


10 3 


1.31 


1.23 


1.54 


2.26 


3.58 


10 4 


100 


5 


20 


1.69 


2.25 


3.67 


7.29 


13.93 


10 4 


100 


10 


10 


1.29 


1.70 


2.49 


4.54 


8.26 


10 4 


100 


100 


1 


1.07 


1.32 


1.84 


3.73 


9.66 



Table 1: We report the ratio of the observed fraction of false positives a b s in the likelihood ratio test to the fraction a 
expected under the assumption that LRS is ^-distribution with 1 df, across a range of parameters (rows) and a range 
a values (columns 6-9). The P-values a expected under the \ 2 distribution are indicated in the top row. In the fifth 
column we also report the ratio of the population size most likely under the neutral null hypothesis N to its true value 
of N. See text for other notations. 



10 4 allele-frequency trajectories. For each simulated allele-frequency time series we computed the LRS as described 
in Materials and Methods. Thus, for each combination of N, L and T we obtained the true distribution of LRS under 
the neutral null hypothesis. We calculated the critical values of LRS for a range of P-values, a, assuming that the LRS 
is distributed as x 2 with one degree of freedom (df) and obtained the fraction of resulting false positives a b s for each 
expected P-value. 

The results shown in Table [j] Figure [l] and Supplementary Figure 1 demonstrate that the x 2 distribution is a poor 
approximation for the true distribution of LRS under neutrality, when the number of sampled time points is finite. In 
particular, the P-value calculated from the x 2 distribution with 1 df systematically underestimates the number of false 
positives a b s even with as many as 100 sampled time points. While this discrepancy is moderate (less than a factor 
of 2) for relatively high nominal P-values (e.g., above 1%) it becomes increasingly more severe for more stringent 
P-values, so that in some regimes the \ 2 test will reject neutrality almost 10-times as often as it should (see Table[l]i. 

As the third section of Table [T] shows, when the number of sampled points increases (and A decreases), with all 
other parameters held constant, the x 2 P-value approaches an accurate representation of the true rate of false positive 
- that is, the ratios shown in the table approach 1.0. This result reflects the (slow) convergence of the LRS distribution 



to the x 2 distribution with increasing amount of data, as guaranteed by the classical result of WlLKS ( 1938 1. 

In addition to the deviation of the true LRS distribution from the x 2 distribution, the most likely value of N under 
the null hypothesis, N, systematically overestimates true TV, especially when the number of data points is small (see 
Table [TJ. This upward bias in estimating population sizes has been previously reported by several authors ( |Waples") 
1989||WlLLlAMSON and Slatkin||1999||Wang[|2001| >. As expected, the bias in the estimation of N decreases as 
the number of data points increases, independent of the true population size or the number of generations represented 
in the sample (Supplementary Figure 1). 



9 



LRS LRS LRS FIS 

X 2 distribution with 1 df Distribution under true N Distribution under N Student's f-distribution 




I 1 1 1 1 1 I 1 1 1 1 1 l 1 1 1 1 1 I 1 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 6 0.8 1.0 

Quantile 



Figure 1: Comparison of distributions of test statistics with their approximations. Shown is the probability that 
a test statistic falls within each quantile of a given distribution. Panels A-C. The probability distribution for LRS to 
fall into each quantile of (A) the \ 2 distribution with 1 df; (B) the distribution under the true N; (C) the empirical 
LRS distribution under N, The LRS statistic falls into the top quantiles of the x 2 distribution with 1 df more often 
than expected which indicates that the P-value given by the \ 2 distribution underestimates the true probability of a 
Type I error. The distribution of LRS under N closely approximates the LRS distribution under true N. Panel D. 
The probability distribution for the FIS to fall into each quantile of the Student's i-distribution with L df. Student's 
t distribution is a good approximation for the true distribution of the t-statistic. Parameter values used: N — 10 3 , 
A = 20, L = 5. We ran 200,000 and 300,000 simulations to estimate the LRS distribution under true N and under N, 
respectively. 

3.2 "Empirical" likelihood ratio test 

In order to fix the shortcomings of the \ 2 test for selection we introduced two new methods. The most direct approach 
would be to compare the LRS statistic observed in the data to the true distribution of the LRS statistic assuming 
neutrality. However, this approach is not possible, because in practice we do not know the true value of the population 
size N under which to generate the distribution of the LRS. Therefore, as an intuitive approximation, we propose first 
to estimate the most-likely population size assuming neutrality, N, to generate, via neutral Wright-Fisher simulations, 
the empirical LRS distribution with N as the population size, and to compare the LRS observed in the data against 
this empirical LRS distribution to determine whether or not to reject neutrality. We call this approach the empirical 
likelihood ratio test (ELRT). 

Figure [Tj shows that the empirical LRS distribution is an excellent approximation to the true LRS distribution 
even when the number of samples is small. Although the P-values computed with the empirical LRS distribution 
are accurate, this approach is relatively computationally intensive because many Wright-Fisher simulations must be 
performed to obtain the empirical distribution, and each simulation must be accompanied by the calculation of the 
LRS. To reduce the computational burden, in the next section we propose an alternative to the likelihood ratio test that 
is computationally extremely easy, but is somewhat less accurate. 
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3.3 "Frequency increment" test 

We define the "frequency increment" (FI) statistic as 

Y 

t F i(Data) = . (12) 
where Y and S are the sample mean and the sample variance 

y = iX> and s 2 = -L-£(y,-y) 2 

»=1 i=l 

of the rescaled allele frequency increments 

Vi — Vi^i 

Y% = , , i = 1, 2, . . . , L. 

v /2i/ i _ 1 (l - - 

Since under the neutral null hypothesis the allele frequency v behaves approximately as Brownian motion as long as 



it is far from the boundaries or 1 (see EWENS (2004i and Appendix), the random variables Yi are independent and 



approximately normally distributed with mean and variance 1 (see equation ( 10 1). Thus, the distribution of the FI 
statistic under neutrality can be approximated by the Student's t-distribution with L degrees of freedom. We call this 
test the frequency increment test (FIT). 

Figure [T] and Table [2] show that the test of neutrality based upon the FI statistic coupled with the Student's t- 
distribution substantially outperforms the likelihood ratio test with the \ 2 distribution - in the sense that it is far less 
biased. Nevertheless, the nominal P-value based on the i-distribution with L df does not always accurately reflect the 
true probability of Type I error. Under most parameter regimes where the probability of Type I error deviates from the 
P-value reported by the i-distribution, the FI test is overly conservative (i.e., the reported P-value overestimates the 
probability of Type I error), but the precise dependence involves N, L, A in a complicated way (Table |2j. In any case, 
the deviation in the size of this test is an order of magnitude smaller than that of the LRT, in all parameter regimes 
tested. 



3.4 Power of ELRT and FIT to detect selection 

Next we tested whether the empirical likelihood ratio test (ELRT) and the frequency increment test (FIT) have power to 
detect selection in allele-frequency data. To this end, we ran Wright-Fisher simulations as described above but now with 
one of the two alleles having selective advantage s. For each N and s we simulated 10 4 allele-frequency trajectories. 
As before, we sampled L + 1 frequencies of the selected allele spaced A generations apart. For each combination of 
N, s, L and A we performed ELRT and FIT, and we calculated how often each test reports a significant outcome for a 
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a 


N 


T 


L + l 


A 


5% 


1% 


0.1% 


0.01% 


10 4 


8 


5 


2 


1.00 


1.02 


1.03 


1.10 


10 4 


98 


50 


2 


0.99 


0.97 


0.93 


0.90 


10 4 


998 


500 


2 


0.83 


0.69 


0.46 


0.28 


10 4 


9 


10 


10° 


1.00 


1.01 


0.97 


1.01 


10 4 


90 


10 


10 1 


0.99 


1.00 


1.01 


1.06 


10 4 


900 


10 


10 2 


0.96 


1.05 


1.22 


1.24 


10 3 


90 


10 


10 1 


0.99 


1.10 


1.34 


1.47 


10 4 


900 


10 


10 2 


0.96 


1.05 


1.22 


1.24 


10 5 


9000 


10 


10 3 


0.53 


0.42 


0.21 


0.12 



Table 2: We report the ratio of the observed fraction of false positives a b s in the FI test to the fraction a expected 



under the assumption that the FI statistic ( 12 1 is distributed according to Student's t -distribution with L df, across a 
range of parameters (rows) and a range a values (columns 5-8). The P-values a are indicated in the top row. See text 
for other notations. 



given P-value cutoff. 

As shown in Figure [2] both the ELRT and the FIT have substantial power to detect the action of natural selection 
when selection is moderate to strong (Ns > 10), with FIT having slightly more power than the ELRT. The power 
of both tests strongly depends on the amount of time that the allele frequency is observed: longer observations yield 
substantially more power. The power to detect selection depends only weakly on the number of sampled time points 
and on the population size, all other parameters being equal. 



3.5 The effect of sampling 

So far we demonstrated that we can accurately calculate the sizes of the empirical likelihood ratio test and the frequency 
increment test (FIT) and that these tests have substantial power to detect selection in an idealized situation when the 
exact allele frequency in the population is known. We now investigate the behavior of these tests in a more realistic 
situation when the allele frequency is measured by taking a sampling and typing a finite number of individuals from 
the population. We used the same time-series trajectories as in previous sections, but now instead of analyzing the 
true allele frequencies x we drew binomial random variables with sample size n and success probability x to obtain 
the sampled allele frequencies v. We then analyzed the test size and the test power using sampled allele frequencies 
v as the data. Figure [3] shows that the P-values estimates for both ELR and FI tests remain unbiased when the allele 
frequencies are sampled. However, as expected, the power of both tests to reject neutrality declines as the number of 
samples decreases (Figure 0). If the number of samples falls below 100 per sampled time point, i.e., when sampling 
noise becomes strong, we lose the ability to detect selection unless the selection coefficient is very high (e.g., Ns > 20) 
or the time series is very long. 



12 



L = 5 



L = 10 



L = 50 



CO 

o 



CO 

o 



CD 
O 

CL "* 

o 



CM 

o 



-- FIT, T=.1N 


A 


FIT, T=.01N 




ELRT, T = .1N 


/ 


- ELRT, T = .01N 


I 

I 




i 




I 

/ 




I * 




1 * 
I 1 








* / 




/ 




/ 




I I 
1 2 


1 1 

10 20 



B 




T" 




1 2 10 

Selective Pressure (A/s) 



l r 

20 1 2 



10 



— r 

20 



Figure 2: Power of ELRT and FIT as a function of strength of selection. Power is the fraction of trail data sets 
generated by the Wright-Fisher model with selection for which ELRT (yellow lines) or FIT (blue lines) rejects the 
neutral null hypothesis at P-value a = 0.05. Dashed (solid) lines indicate data sets where allele frequency was 
sampled for T = N/10 (T — AT/100) generations. Each point represents an average over 10 3 trials, with A~ = 10 4 . 



LRS LRS LRS FIS 



X 2 distribution with 1 df Distribution under true N Distribution under N Student's f-distribution 




I 1 1 1 1 1 I 1 1 1 1 1 I 1 1 1 1 1 I 1 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Quantile 



Figure 3: Comparison of distributions of test statistics with their approximations, allele frequencies are sampled 

Shown is the probability that a test statistic falls within each quantile of a given distribution when allele frequencies 
are sampled. Panels A-C. The probability distribution for LRS to fall into each quantile of (A) the distribution under 
the true A^; (B) the \ 2 distribution with 1 df; (C) the LRS distribution under AT. Panel D. The probability distribution 
for the FIS to fall into each quantile of the Student's t-distribution with L df. Student's t distribution is a good 
approximation for the true distribution of the t-statistic. Parameter values used: AT = 1000, A = 20, L — 5, n = 500. 
We ran 200,000 and 300,000 simulations to estimate the LRS distribution under true A^ and under A^, respectively. 
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Figure 4: Power of ELRT and FIT as a function of strength of selection under various sampling regimes. Each 
point represents an average over 10 3 trials, except for ELRT with 10 4 subsamples, which used only 2, 500 trials. 

4 Conclusions 



We have demonstrated that the standard, \ 2 -based test for selection in time series of allele frequencies (BOLLBACK 
\et al. | [2008] ) is subject to a greatly elevated false discovery rate, in the practical regime of relatively few sampled time 
points. As a result of this bias, the LRT is not reliable for testing selection in many practical time series, because 
its P-value underestimates the rate of false inferences of selection. We have presented two methods to address this 
problem, developing likelihood-based tests for selection that avoid the \ 2 assumption. We have shown that both our 
methods are unbiased and have power to detect selection in parameter regimes that are reasonable for many microbial 
experiments and natural populations. 

Specifically, our results indicate that our proposed methods should be able to detect selection in the ranges relevant 
for many microbial and viral populations - that is N > 10 4 , s > 0.001, and possibly even smaller selection coefficients 
in larger populations. These techniques, just as the x 2 based techniques, will be limited to the simple situations in 
which the frequencies of alleles observed at a locus are not influenced by mutations that possibly arise elsewhere in 
the genome, during the time of observation. This simple situation, in which clonal interference is absent, applies to 



the competitive fitness assays used in evolution experiments (LENSKI et al. 1991 GALLET et al. 2012 1 or to tracking 
known polymorphisms in natural populations for a relatively short time ( BARRETT et al. 2008 , WINTERS et al. 2012 
DANIELS et al. 2012| PENNINGS et al. 2012 1. By contrast, inferring selection coefficients when allele dynamics 
are influenced by multiple linked sites is a substantially more difficult problem, which has begun to be addressed 
elsewhere ( jlLLlNGWORTH and Mustonen] |2011| |ILLINGWQRTH et al~\ |2012[ >, although not, of course, with an 
rigorous likelihood framework for the dynamics of all possible genotypes. 
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5 Appendix A. Gaussian approximation to the Moran process 

We approximate the continuous -time Moran processes with a combination of a deterministic process and Guassian 



noise process. We follow here the procedure outlined by POLLETT ( 1990 1. The Moran's stochastic process describes 



the number n'"' (t) of mutants in a population of constant size N at time t. This number can increase by one from i to 
i + 1 with rate 

r w (M+i)=w , ■ 

\ w (Jy —t) + A m i 



and decrease by one with rate 



\ W (N - i) + X m i 

Here, fj, w and are the birth and death rates of the wildtype, and fi m and A m are the birth and death rates of the 
mutant type, respectively. We assume X w — A m , /j, w — 1, and let \i m = (1 + s)fi w = 1 + s. Then 

r ( - N \i,i + l)=Nf +1 (i/N), r^(i,i - 1) = Nf-^i/N) (13) 



with 



Define 



f +l {x) = (1 + s)x(l - x), /_i(a;) = x(l - x). 



F(x) = Yl Sf s {x)=sx(l-x) 
se{-i,+i} 

G{x) = S 2 f s (x) = (2 + s)x{l-x). 

se{-i,+i} 

Let X( N \t) = rS N \t)/N be the frequency of the mutant in the population at time t. The limit of X^ N \ g(t, xq) — 
limjv^oo X^ (t), is a deterministic function that, under certain regularity conditions, satisfies equations ([3J, Q with 
£0 = liniw-s-oo X( N *>(Q) and solution given by pV 
Now let 

Z(i) = lim ^(xW(t)-g(t,x )) (14) 

be the asymptotic process that describes the noise around the deterministic trajectory. If we knew the distribution of 
Z(t), we could approximate the frequency at a finite iV" by 

XC)(i)^(t, IO ) + i% (15) 
v iv 
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The asymptotic noise process is in general a diffusion process, but, as long as it remains far from absorbing bound- 
aries, it can be approximated by a Gaussian process with the corresponding first two moments. The advantage of this 
approach is that the first two moments of the diffusion process can be computed analytically, resulting in an expression 
for the probability distribution of the allele frequency at time t. 

If zq = limjv^oo V ~N(XW (0) — Xq) is the initial value of the limiting noise process, then the mean and variance 
of the noise process at time t > are EZ(t) = M(t, Xq)z and Var Z(t) = a 2 (t, Xq) respectively, where M(t, Xq) 
satisfies the equations 

M(0,x ) = 1 (17) 
and <r 2 (t, Xo) satisfies the equations 

^ = 2F'(g(t,x ))a 2 + G(g(t,x )) = 

(1 - .T )e- St - Xq 2 (2 + S)X Q {1 - X Q )e~ st 

(l-x )e-^+x ' + ((1 - x )e-st + x )- 2 K *> 

a 2 (0,x ) = 0. (19) 

The solution to equations ( [To*} , ( [T7| is given by 

M(t, x ) = exp ( / F'(g(T, xo))d T } , 

that, after substituting F' and g, yields 

M(t,x ) = e- st (x + (1 - xo)e- st y 2 . 

The solution to equations ( fl8] >, ( [T9| is given by 

a 2 (t,x ) =M 2 {t,x ) f M- 2 (T,x Q )G(g(T,x Q ))dT, 
Jo 

that, after substituting G and g, yields 

a 2 (t,x ) = M 2 (t,x Q )(2 + s)x (l-x ) S - 1 

x 2x s(l - x )t + x 2 e st - (1 - x ) 2 e~ st + ((1 - x ) 2 - x 2 ) 
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If the true state of the stochastic process is known to be X^ N \0) at the time point 0, we can approximate the 



initial value of the limiting noise process as z ~ y/~N (X^ n ' (0) — Xq). Then from ( 15 1 we have 



Analogously, if the value of the process X^ N > is known to be X^ N ' (t r ) at a later time £' > x , then at time t > t' we 
have 

E t ,xW(t) « ff (^o)+M(At, J (t' lI o))(x W W-9(<'^)) 1 (20) 
Var t , XW(f) a l^Ai,^',^)), (21) 

where At = t — t', and E t / and Var t / denote conditional expectation and variance given the state of the process 
at time f '. Thus, the conditional distribution of the allele frequency X^ at time t given its value at time t' < t 
can be approximated by a Gaussian distribution with mean given by ( [20) ) and variance given by pi) . We apply 
this approximation to every observation interval (£t-i,ti),£ = 1, ■ • ■ ,L. As noted above, the initial value of the 
deterministic process, x n , is a free parameter that can be fitted along with and s. However, we set xq to be equal to 
the observed allele frequency v at time in order to reduce the number of fitted parameters. 

Note that the approximations described here work for the Moran process which is density-dependent as can be seen 
from equations ( [13) , The Wright-Fisher process is not density-dependent and, strictly speaking, the approximations 
described here are not valid, although in practice they work well. 
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T=N/10 



L = 10 




L trial length (T/N) 

Supplementary Figure 1: Bias in the maximum-likelihood estimate of population size, under neutrality. Shown 
is the ratio of the ML estimated population size N to the true population size N as a function of the number of sampled 
points L (left panel) and as a function of the length of the observed time series LA (right panel). Bars indicate ±2 
standard deviation of the distribution of N/N. As expected, bias in N decreases as the number of sampled time points 
increases. The bias is almost independent of N and the length of the sampling period. Parameters used: (A) 100,000 
trials for each N, 1000 points displayed, A = 1; (B) 100,000 for each L/N with N = 10 4 , A = 1. 
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